Enhancing RAG with Agentic AI and Multi-HyDE for Knowledge Retrieval and Hallucination Reduction using Mistral and Qwen2 LLMs

Authors: Sumedha Arya

DOI Link: https://doi.org/10.22214/ijraset.2026.77501

Abstract

Financial data is very sensitive and keeps on changing. This makes it important for accurate and reliable information retrieval in financial question-answering systems. These systems must provide precise answers using correct and up-to-date information from sources like financial news, reports, and regulatory filings. Traditional retrieval systems usually search using only one query and one database. This approach is inefficient for finance, because financial documents are complex, detailed, and spread across many years. A single query may miss important information or retrieve incomplete results, results in hallucination. To solve this problem, we used a framework that considers a financial Retrieval Augmented Generation (RAG) system with agentic AI and the Multi-HyDE method. The framework is tested using two different LLMs – MistralAI and Qwen2. Both the models demonstrated the faithfulness score of 1.0 and represents the ability of framework to reduce hallucinations in LLMs.

Introduction

Large Language Models (LLMs) such as OpenAI’s GPT-4, Meta’s LLaMA, and Google’s PaLM have significantly advanced Natural Language Processing (NLP). They demonstrate strong contextual understanding, reasoning ability, and human-like text generation, even with minimal examples (few-shot learning). As a result, LLMs are increasingly used in high-stakes domains such as healthcare, law, and finance.

However, LLMs suffer from hallucination, meaning they may generate incorrect or fabricated information. In finance, such errors can cause financial losses, reputational damage, and regulatory issues.

Retrieval-Augmented Generation (RAG)

To reduce hallucinations, researchers introduced Retrieval-Augmented Generation (RAG). In RAG:

Relevant documents are retrieved from an external database.
The LLM generates answers based on these real documents.

This improves factual accuracy and reliability.

Enhancements to RAG include:

Improved embeddings for better semantic search
Hybrid retrieval (combining dense semantic search and sparse keyword search like BM25)
Hypothetical Document Embeddings (HyDE), where the LLM generates a hypothetical answer first, embeds it, and retrieves similar real documents

A more advanced approach, Agentic RAG, enables the LLM to function as an intelligent agent that:

Breaks complex questions into smaller steps
Retrieves information iteratively
Uses tools (e.g., calculators)
Verifies results before generating the final answer

This is especially useful in finance, where questions may require analyzing multiple reports and numerical data.

Financial RAG Challenges

Financial QA systems must handle:

Long annual reports and regulatory filings
Earnings transcripts and financial news
Numerical precision and time-sensitive data
Regulatory compliance requirements

Even minor numerical errors can have serious consequences.

Existing financial systems (e.g., FinRobot, FinSage) improve performance but still face challenges in retrieval accuracy, disambiguation, and evaluation reliability. Benchmarks such as FinanceBench show that even advanced LLMs struggle significantly with financial accuracy.

Proposed Solution: Multi-HyDE + Agentic Financial RAG

The research proposes a Financial RAG framework integrating:

Multi-HyDE
- Generates multiple hypothetical queries instead of one
- Improves retrieval diversity and accuracy
- Avoids increasing computational cost
Hybrid Retrieval
- Dense retrieval using FAISS (semantic similarity)
- Sparse retrieval using BM25 (keyword matching)
- Combined ranking for better coverage
Agentic Reasoning System
- Uses step-by-step reasoning
- Calls retrieval and calculation tools
- Reduces hallucination by grounding answers in evidence

Methodology

1?? Data Preparation

Dataset: ~49,637 real financial news records (2003–2020) from Kaggle
Cleaned and preprocessed without synthetic data

2?? Models Used

Two LLMs were tested:

Mistral-7B-Instruct-v0.3
Qwen2-7B-Instruct

Both used:

Hugging Face Transformers
Low-temperature generation for deterministic output
all-MiniLM-L6-v2 for embeddings (384-dimensional vectors)

3?? Hybrid Indexing

Dense Index: FAISS with normalized embeddings
Sparse Index: BM25 with tokenized text
Combination improves retrieval of both semantic meaning and exact financial terms

4?? Multi-HyDE Retrieval Process

Generate multiple query variants
Create hypothetical answers for each
Embed and retrieve documents via dense + sparse search
Rank and combine results

This reduces semantic mismatch between short user queries and long financial documents.

5?? Agentic RAG Loop

Defines tools: retrieval + calculation
LLM follows structured reasoning steps (THOUGHT → ACTION → OBSERVE → ANSWER)
Limits steps to avoid infinite loops
Ensures responses are evidence-based

6?? Faithfulness Evaluation

Extract factual claims from generated answers
Check whether retrieved documents support each claim
Compute a faithfulness score (0.0–1.0)

A score of 1.0 indicates all claims are supported by retrieved evidence.

Results

Testing the framework on both Mistral and Qwen2 models showed:

Faithfulness score of 1.0
Strong reduction in hallucinations
Accurate, grounded financial question answering
Reliable performance without excessive computational cost

Conclusion

Overall, both the Mistral and Qwen2 Financial RAG systems performed very well. They correctly identified that DCB Bank’s profit before tax declined by 37.6% to ?93.84 crore in Q4 FY20, announced in May 2020, and achieved perfect faithfulness scores. The Qwen2 model showed slightly cleaner reasoning and marginally better retrieval alignment. However, both models proved that 7B-scale open language models, when combined with Multi-HyDE retrieval, hybrid search, agentic reasoning, and verification, can produce reliable and grounded financial news answers. This confirms that the overall architecture is strong and suitable for building trustworthy financial question-answering systems on medium-sized historical datasets.

References

[1] D. Bahdanau, K. Cho, and Y. Bengio, \"Neural machine translation by jointly learning to align and translate,\" arXiv preprint arXiv:1409.0473, 2014. [2] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, \"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection,\" Preprint, arXiv:2310.11511, 2023. [3] C.-Y. Chang, Z. Jiang, V. Rakesh, M. Pan, C.-C. M. Yeh, G. Wang, M. Hu, Z. Xu, Y. Zheng, M. Das, and N. Zou, \"MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation,\" Preprint, arXiv:2501.00332, 2024. [4] Z. Chen, S. Li, C. Smiley, Z. Ma, S. Shah, and W. Y. Wang, \"ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering,\" Preprint, arXiv:2210.03849, 2022. [5] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, and others, \"PaLM: Scaling language modeling with pathways,\" arXiv preprint arXiv:2204.02311, 2022. [6] M. Eibich, S. Nagpal, and A. Fred-Ojala, \"ARAGOG: Advanced RAG Output Grading,\" Preprint, arXiv:2404.01037, 2024. [7] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, \"RAGAS: Automated Evaluation of Retrieval Augmented Generation,\" Preprint, arXiv:2309.15217, 2023. [8] L. Gao, X. Ma, J. Lin, and J. Callan, \"Precise Zero-Shot Dense Retrieval without Relevance Labels,\" in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1762–1777, Toronto, Canada, Association for Computational Linguistics, 2023. [9] T. Gao, X. Yao, and D. Chen, \"SimCSE: Simple contrastive learning of sentence embeddings,\" in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, 2021. [10] S. Girhepuje, S. S. Sajeev, P. Jain, A. Sikder, A. R. Varma, R. George, A. G. Srinivasan, M. Kurup, A. Sinha, and S. Mondal, \"RE-GAINS & EnChAnT: Intelligent Tool Manipulation Systems For Enhanced Query Responses,\" Preprint, arXiv:2401.15724, 2024. [11] Z. Guo, L. Xia, Y. Yu, T. Ao, and C. Huang, \"LightRAG: Simple and Fast Retrieval-Augmented Generation,\" Preprint, arXiv:2410.05779, 2024. [12] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, \"REALM: Retrieval-Augmented Language Model Pre-Training,\" arXiv preprint arXiv:2002.08909, 2020. [13] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, \"Reasoning with Language Model is Planning with World Model,\" Preprint, arXiv:2305.14992, 2023. [14] P. Henderson, K. Sinha, N. Angelard-Gontier, N. R. Ke, G. Fried, R. Lowe, and J. Pineau, \"Foundation models for legal reasoning,\" arXiv preprint arXiv:2307.03557, 2023. [15] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, \"LoRA: Low-Rank Adaptation of Large Language Models,\" in International Conference on Learning Representations, 2021. [16] Y. Huang, X. Sun, Y. Xiong, Z. Dou, G. Zhang, and J. Yuan, \"A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,\" arXiv preprint arXiv:2311.05232, 2023. [17] P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen, \"FinanceBench: A New Benchmark for Financial Question Answering,\" Preprint, arXiv:2311.11944, 2023. [18] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, \"Survey of hallucination in natural language generation,\" ACM Computing Surveys, 2023. [19] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, \"Dense passage retrieval for open-domain question answering,\" arXiv preprint arXiv:2004.04906, 2020. [20] O. Khattab and M. Zaharia, \"ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,\" in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48, 2020. [21] LangChain, \"Query Transformations,\" 2023. [22] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, and others, \"Retrieval-augmented generation for knowledge-intensive NLP tasks,\" in Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020. [23] Z. Li, J. Wang, Z. Jiang, H. Mao, Z. Chen, J. Du, Y. Zhang, F. Zhang, D. Zhang, and Y. Liu, \"DMQR-RAG: Diverse multi-query rewriting for RAG,\" Preprint, arXiv:2411.13154, 2024. [24] Z. Li, H. Wang, Z. Chen, and X. Chen, \"FinBERT: A pre-trained financial language representation model for financial text mining,\" in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2023. [25] Y. Liu, Y. Xie, C. Chen, S. Wang, Y. Yuan, Y. Liu, X. Hu, S. Wang, T. Qiao, L. Pan, and others, \"ToolLLM: Facilitating large language models to master 16000+ real-world APIs,\" arXiv preprint arXiv:2307.16789, 2023. [26] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, and 262 others, \"GPT-4 technical report,\" Preprint, arXiv:2303.08774, 2024. [27] Y. Qin, S. Deng, F. Xu, S. Chen, Y. Lin, W. Sun, M. Bu, P. Li, S. Zhou, C. Yang, and others, \"Tool learning with foundation models,\" arXiv preprint arXiv:2304.08354, 2023. [28] A. Radhakrishnan, K. Nguyen, A. Chen, C. Chen, C. Denison, D. Hernandez, E. Durmus, E. Hubinger, J. Kernion, K. Lukoši?t?, N. Cheng, N. Joseph, N. Schiefer, O. Rausch, S. McCandlish, S. El Showk, T. Lanham, T. Maxwell, V. Chandrasekaran, and 5 others, \"Question decomposition improves the faithfulness of model-generated reasoning,\" Preprint, arXiv:2307.11768, 2023. [29] N. Reimers and I. Gurevych, \"Sentence-BERT: Sentence embeddings using Siamese BERT-networks,\" in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, 2019. [30] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, \"Toolformer: Language models can teach themselves to use tools,\" arXiv preprint arXiv:2302.04761, 2023. [31] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Venkataraman, G. Maginnis, A. Nori, and others, \"Large language models in medicine,\" Nature Medicine, vol. 29, no. 8, pp. 1998–2012, 2023. [32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, and others, \"Llama: Open and efficient foundation language models,\" arXiv preprint arXiv:2302.13971, 2023. [33] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim, \"Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,\" Preprint, arXiv:2305.04091, 2023. [34] X. Wang, J. Chi, Z. Tai, T. S. T. Kwok, M. Li, Z. Li, H. He, Y. Hua, P. Lu, S. Wang, Y. Wu, J. Huang, J. Tian, F. Mo, Y. Cui, and L. Zhou, \"FinSage: A multi-aspect RAG system for financial filings question answering,\" Preprint, arXiv:2504.14493, 2025. [35] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, \"BloombergGPT: A large language model for finance,\" arXiv preprint arXiv:2303.17564, 2023. [36] Y. Wu, T. Yue, S. Zhang, C. Wang, and Q. Wu, \"StateFlow: Enhancing LLM Task-Solving through State-Driven Workflows,\" Preprint, arXiv:2403.11322, 2024. [37] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk, \"Approximate nearest neighbor negative contrastive learning for dense text retrieval,\" arXiv preprint arXiv:2007.00808, 2020. [38] S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling, \"Corrective Retrieval Augmented Generation,\" Preprint, arXiv:2401.15884, 2024. [39] H. Yang, B. Zhang, N. Wang, C. Guo, X. Zhang, L. Lin, J. Wang, T. Zhou, M. Guan, R. Zhang, and C. D. Wang, \"FinRobot: An Open-Source AI Agent Platform for Financial Applications using Large Language Models,\" Preprint, arXiv:2405.14767, 2024. [40] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, \"ReAct: Synergizing reasoning and acting in language models,\" arXiv preprint arXiv:2210.03629, 2022. [41] B. Zhang, J. Shlens, and J. Dean, \"Designing effective sparse expert models,\" arXiv preprint arXiv:2202.08906, 2022. [42] L. Zhang, Y. Wu, Q. Yang, and J.-Y. Nie, \"Exploring the best practices of query expansion with large language models,\" Preprint, arXiv:2401.06311, 2024. [43] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi, \"Least-to-most prompting enables complex reasoning in large language models,\" Preprint, arXiv:2205.10625, 2023.

Copyright

Copyright © 2026 Sumedha Arya. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET77501

Publish Date : 2026-02-16

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here